Using lateral joins for business analytics in CockroachDB 20.1

Using lateral joins for business analytics in CockroachDB 20.1

CockroachDB’s 20.1 release supports lateral joins, which can dramatically reduce the lines of code needed to arrive at key business analytics. Today, I’m going to walk you through a demo of lateral joins and showcase how you might use them to run analytic queries directly in CockroachDB.

What’s a lateral join?

A lateral join is a type of correlated subquery that allows data sources to reference columns provided by previous data sources in the same query. In practice, this can be used to iterate over each row of a result, similar to a nested foreach loop. We will walk through an example below using CockroachDB 20.1 and our fictional ride-sharing company, MovR.

Cluster Setup

First, let’s set up CockroachDB 20.1 using your preferred method including Linux, Mac, or Windows binaries, Docker images, or building directly from source.

This example will focus on the MovR rides table that is defined as:

CREATE TABLE rides (
                  id UUID NOT NULL,
                  city VARCHAR NOT NULL,
                  vehicle_city VARCHAR NULL,
                  rider_id UUID NULL,
                  vehicle_id UUID NULL,
                  start_address VARCHAR NULL,
                  end_address VARCHAR NULL,
                  start_time TIMESTAMP NULL,
                  end_time TIMESTAMP NULL,
                  revenue DECIMAL(10,2) NULL                   
                      )

Note, I’ve simplified this table a bit in case you want to follow along and copy-paste this table and the subsequent queries into CockroachDB. You can also use the new CockroachDB demo feature that we recently blogged about to try this out in less than five minutes.

To use CockroachDB demo (and the more complex rides table with pre-populated data) enter the following command:

./cockroach demo

This command will automatically load the MovR tables and data:

show tables;

table_name
+----------------------------+
  promo_codes
  rides
  user_promo_codes
  users
  vehicle_location_histories
  vehicles
(6 rows)

MovR OKRS: A Lateral Join Example

The executive team at MovR has set a number of important Objectives and Key Results (OKRs) for MovR. The top objective for this year is to increase riders who use MovR. To do that, the executive team determined that the most important key result is to lower the average time between customers’ first ride and their second ride. This is an important metric because it helps customers build a habit of using MovR and increases the lifetime value of the customer, a key input into profitability.

MovR uses CockroachDB as its OLTP system of record and it keeps track of every ride for each user. While we can export any data from CockroachDB via change data capture (CDC) into an analytics oriented database, in this case, we don’t need to as we can derive this answer directly in CockroachDB using SQL’s Lateral Join syntax.

Eventually, we will have a fairly complex 25 line query. Before we get to the full query, let’s start in the middle and work our way out. First, we need to calculate the first time a user took a ride with MovR:

SELECT
    rider_id, 
    1 AS first_ride,
    min(start_time) AS first_ride_time
  FROM rides
  GROUP BY rider_id 

This query finds the minimum start time of a ride for a given rider_id and alias it to the variable first_ride_time. It also keeps track of whether or not a user has completed a first ride by entering either 1 or null into a first_ride column.

Next, we need to determine if the customer took as second ride, and, if so, determine how long after the first ride that the customer took their second ride:

SELECT
  1 AS second_ride,
  start_time AS second_ride_time
FROM rides
WHERE
  rider_id = r1.rider_id AND
  start_time > first_ride_time 
ORDER BY start_time
LIMIT 1

Similarly to above, we are calculating a second_ride_time variable based on start_time. However, this time we require the start time to be greater than the variable we previously defined as first_ride_time. This ensures that we get our second ride time.

Altogether, we can pair these queries into our larger query:

SELECT 
  rider_id,
  first_ride,
  first_ride_time,
  second_ride,
  second_ride_time
FROM (
    --get the first ride time
 SELECT
    rider_id,
    1 AS first_ride,
    min(start_time) AS first_ride_time
  FROM rides
  GROUP BY rider_id
) r1 LEFT JOIN LATERAL (
    -- Get the second ride time 
SELECT
  1 AS second_ride,
  start_time AS second_ride_time
FROM rides
WHERE
  rider_id = r1.rider_id AND
  start_time > first_ride_time 
ORDER BY start_time
LIMIT 1
) r2 ON true

This will produce a table that looks like:


	                   rider_id    | first_ride | first_ride_time                  | second_ride | second_ride_time
+--------------------------------------+------------+----------------------------------+-------------+----------------------------------+
  04553af9-dc34-461a-95fa-8c9e7123a613 |          1 | 2019-11-13 16:10:00.627167+00:00 |           1 | 2019-11-13 16:12:59.356821+00:00
  77cacc32-57dc-4f52-83a4-d50a9a65b8bd |          1 | 2019-11-13 16:06:33.618505+00:00 |           1 | 2019-11-13 16:06:37.002664+00:00
  c2c25f39-d2a0-480f-a5bb-36ec5436e34f |          1 | 2019-11-13 16:35:06.696277+00:00 |        NULL | NULL
  418c9765-4a04-4728-9f5d-981d3a6e9c5d |          1 | 2019-11-13 16:06:28.37234+00:00  |           1 | 2019-11-13 16:19:33.451015+00:00
  4be831d4-9115-412b-b2c4-dceb92c81e10 |          1 | 2019-11-13 16:15:32.557127+00:00 |           1 | 2019-11-13 16:20:33.258865+00:00

This is great, but we can do even better by directly making the comparison within CockroachDB using the avg comparison on the two defined variables: first_ride_time and second_ride_time.

 SELECT 
  avg(second_ride_time::DECIMAL - first_ride_time::DECIMAL) as average_time_to_second_ride
FROM (
    --get the first ride time
 SELECT
    rider_id,
    1 AS first_ride,
    min(start_time) AS first_ride_time
  FROM rides
  GROUP BY rider_id
) r1 LEFT JOIN LATERAL (
    -- Get the second ride time 
SELECT
  1 AS second_ride,
  start_time AS second_ride_time
FROM rides
WHERE
  rider_id = r1.rider_id AND
  start_time > first_ride_time 
ORDER BY start_time
LIMIT 1
) r2 ON true

This will produce a result like: 
  average_time_to_second_ride
+-----------------------------+
        38690.915436623268698
(1 row)
 
Time: 16.576ms

*Note, if you are following along at home you may see a different final result because we generate random data in the MovR application.

It this case it takes about 10 hours for our riders to complete their second ride. Now we have actionable data as a baseline for our OKR. We can measure the success of our efforts to lower the time to second ride as a key result moving forward. It may also begin to clue us in on a specific pattern--MovR customers may be using MovR to commute to and from work, a time of approximately 10 hours on average.

Try Lateral Joins in CockroachDB 20.1

CockroachDB’s 20.1 release now supports lateral joins, which can dramatically reduce the lines of code needed to arrive at key business analytics. It opens up a whole new world of possibilities for application developers to enhance the power of correlated subqueries. As we show above, a LEFT JOIN to a LATERAL subquery can provide the source rows in the result even if the LATERAL subquery produces no rows for them. To try out lateral joins for yourself, test out CockroachDB 20.1!

 

Keep Reading

Join ordering, part II: The 'SQL'

Even in the 80’s, before Facebook knew everything there was to know about us, we as an industry had …

Read more
Peewee ORM + CockroachDB

This article was originally posted on the personal blog of the Peewee ORM founder, Charles Leifer. Peewee is a …

Read more
Query plan caching in CockroachDB

Since the 2.1 release, CockroachDB has had a cost-based optimizer. Rewriting a big component of an existing …

Read more